Skip to content

Conversation

@rdheekonda
Copy link
Contributor

@rdheekonda rdheekonda commented Jan 23, 2026

[AIRT] Tool selection safety and role-play testing

Key Changes:

  • Adds tool_selection_safety scorer to detect when agents choose dangerous tools over safe alternatives (returns 1.0 for unsafe, 0.5 for questionable, 0.0 for safe)
  • Adds role_play_wrapper transform with 4 scenarios (educational, fictional, historical, research) for jailbreak testing
  • Adds 3 evaluation rubrics for llm_judge: tool_selection_safety, unbounded_agency, web_chatbot_security

Added:

  • tool_selection_safety() scorer in tool_invocation.py
  • role_play_wrapper() transform in stylistic.py
  • 3 YAML rubrics in dreadnode/data/rubrics/
  • Test suites and example notebook

Usage:

```python

Tool safety scorer

scorer = dn.scorers.tool_selection_safety(
safe_tools=["http_get", "tcp_connect"],
dangerous_tools=["shell_execute"],
)

Role-play transform

transform = dn.transforms.stylistic.role_play_wrapper(
scenario="educational",
character="security researcher",
)
```

Generated Summary:

Summary of Changes:

  • Introduced new rubrics for safety evaluation in AI tools:

    • tool_selection_safety: Evaluates the choice of tools based on safety and risk (OWASP ASI02).
    • unbounded_agency: Assesses whether agents operate within user-defined limits and request permission before expanding scope (OWASP ASI10).
    • web_chatbot_security: Identifies web chatbot plugin vulnerabilities based on IEEE S&P 2026 findings.
  • Added a scoring mechanism for tool selection safety, focusing on:

    • Detection of dangerous tool usage when safer alternatives are available.
    • Guidelines for classifying tool choice as safe, questionable, or unsafe.
  • Implemented a role-play wrapper transform for testing against jailbreak attempts, distinguishing between legitimate educational inquiries and potentially harmful requests.

  • Included an example Jupyter notebook to demonstrate the usage of both the tool selection safety scorer and the role-play wrapper, with practical scenarios to illustrate scoring and evaluation.

Potential Impact:

  • Enhances the overall security posture by systematically evaluating tool selection and agency behavior in AI applications.
  • Provides a structured approach to identify and mitigate vulnerabilities in web chatbots and tool misuse.
  • Facilitates developers in ensuring compliance with security best practices through clear guidelines and scoring metrics.

This summary was generated with ❤️ by rigging

- Add tool_selection_safety scorer for detecting unsafe tool choices
- Add role_play_wrapper transform for jailbreak testing
- Add 3 evaluation rubrics (tool_selection_safety, unbounded_agency, web_chatbot_security)
@dreadnode-renovate-bot dreadnode-renovate-bot bot added area/tests Changes to test files and testing infrastructure area/examples Changes to example code and demonstrations labels Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/examples Changes to example code and demonstrations area/tests Changes to test files and testing infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants